1 Introduction

The purpose of this project is to create a machine learning model that can predict the price of a diamond based on its charateristics. The purpose of this file is to do some initial exploratory analysis of the diamonds data set. This data set is provided in the ggplot2 package. This inital exploratory analysis will attempt to answer the following questions:

  1. Best of the Best: Imagine someone wants to buy the best possible diamond, and money is no object. They only want to consider diamonds in the top categories of cut (Ideal), color (D), and clarity (IF). They want the most ideal range for depth (59-63) and table (54-57). Within the dataset, if we plot carat versus price can we fit a clean trendline? Is it linear? Exponential? What’s the price of the largest carat, and is it the most expensive?

  2. Depth and Table Percentages: I found the ideal depth and table values mentioned above online, but let’s explore the dataset a little. if we fix the 4 C’s (carat, cut, color, and clarity), how much do depth and table impact price? If we widen the ranges slightly, can we save a substantial amount?

  3. Best Bang for the Buck: Imagine someone wants to find the diamond which maximizes cut, color, and clarity per dollar. Using the expanded depth and table values from question 2 above, when does price start to increase exponentially for cut? What about for color? And clarity?

  4. Bigger is Better: Imagine a guy named Bob who wants to buy a pair of diamonds for his wife, and have them made into earrings for her birthday. In Bob’s mind, size (carat) is all that matters. He has $3200. He needs Two diamonds with the exact same cut, color, and clarity (with very comparable depth and table values), and he wants them to be as big as possible. What size carat can he afford? If he adjusts his budget, how much does the “maximum carat” size shift? Can we plot that and fit a line to it to find the “knee in the curve”?

2 EDA

2.1 First Look

A sample, summary, and first glimpse of the diamonds data set is provided below:

head(data)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
summary(data)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   J: 2808   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   I: 5422   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   H: 8304   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   F: 9542   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     E: 9797   VVS1   : 3655   Max.   :79.00  
##                                     D: 6775   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 
glimpse(data)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

There are 53940 rows of data. Each row is one observation / one diamond. There are 10 columns of data. Each column is a variable / feature. The features are:

names(data)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

price: price in US dollars (326 to 18823)
carat: weight of the diamond (0.2 to 5.01)
cut: quality of the cut (in order of worst to best; Fair, Good, Very Good, Premium, Ideal)
color: diamond color (in order of worst to best; J, I, H, G, F, E, D)
clarity: a measurement of how clear the diamond is (in order of worst to best; I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF)
x: length in mm (0 to 10.74)
y: width in mm (0 to 58.9)
z: depth in mm (0 to 31.8)
depth: total depth percentage = 2 * z / (x + y) (43 to 79)
table: width of the top of a diamond relative to its widest point (43 to 95)

2.2 Best of the Best

  1. Best of the Best: Imagine someone wants to buy the best possible diamond, and money is no object. They only want to consider diamonds in the top categories of cut (Ideal), color (D), and clarity (IF). They want the most ideal range for depth (59-63) and table (54-57). Within the dataset, if we plot carat versus price can we fit a clean trendline? Is it linear? Exponential? What’s the price of the largest carat, and is it the most expensive?

First we need to filter the data set to only the best cut, color, clarity, depth, and table.

df <- data %>% 
  filter(
    cut=="Ideal",
    color=="D",
    clarity=="IF",
    between(depth, 59, 63),
    between(table, 54, 57))

print(df)
## # A tibble: 24 x 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.51 Ideal D     IF       62      56  3446  5.14  5.18  3.2 
##  2  0.51 Ideal D     IF       62.1    55  3446  5.12  5.13  3.19
##  3  0.53 Ideal D     IF       61.5    54  3517  5.27  5.21  3.22
##  4  0.53 Ideal D     IF       62.2    55  3812  5.17  5.19  3.22
##  5  0.59 Ideal D     IF       60.9    57  4208  5.4   5.43  3.3 
##  6  0.56 Ideal D     IF       62.4    56  4216  5.24  5.28  3.28
##  7  0.56 Ideal D     IF       61.9    57  4293  5.28  5.31  3.28
##  8  0.63 Ideal D     IF       62.5    55  6549  5.47  5.5   3.43
##  9  0.63 Ideal D     IF       62.5    55  6607  5.5   5.47  3.43
## 10  1.04 Ideal D     IF       61.8    57 14494  6.49  6.52  4.02
## # ... with 14 more rows


This filters the data to only 24 options. Let’s plot carat versus price.

ggplot(df, aes(x=carat, y=price)) +
  geom_point()


Looks like a linear model could be viable. Let’s create one.

model_df_lm <- lm(price ~ carat, df)

summary(model_df_lm)
## 
## Call:
## lm(formula = price ~ carat, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2123.5 -1264.7   245.6  1026.5  2344.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5621.9      623.1  -9.023 7.57e-09 ***
## carat        20259.9      911.8  22.221  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1288 on 22 degrees of freedom
## Multiple R-squared:  0.9573, Adjusted R-squared:  0.9554 
## F-statistic: 493.8 on 1 and 22 DF,  p-value: < 2.2e-16


Carat passes the T test (p-value) by a mile and the R squared isn’t too bad. Let’s plot this model now.

pred_df_lm <- predict(model_df_lm, df)

df <- df %>% 
  mutate(pred = pred_df_lm)

ggplot(df, aes(x=carat, y=price)) +
  geom_point(color="blue") +
  geom_line(color="red", aes(y=pred))


Looks like the most expensive diamond might not be the largest one. Let’s find these two data points.

df2 <- df %>% 
  filter(
    carat == max(carat) |
      price == max(price)
  )

print(df2)
## # A tibble: 2 x 11
##   carat cut   color clarity depth table price     x     y     z   pred
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
## 1  1.07 Ideal D     IF       60.9    54 17042  6.66  6.73  4.08 16056.
## 2  1.03 Ideal D     IF       62      56 17590  6.55  6.44  4.03 15246.


Interesting! The most expensive diamond is the not the largest diamond, but it has a larger depth and table compared to the largest carat diamond. This needs a closer look.

2.3 Depth and Table

  1. Depth and Table Percentages: I found the ideal depth and table values mentioned above online, but let’s explore the dataset a little. if we fix the 4 C’s (carat, cut, color, and clarity), how much do depth and table impact price? If we widen the ranges slightly, can we save a substantial amount?

First let’s create a scatter plot of price vs carat.

ggplot(data, aes(x=carat, y=price)) +
  geom_point()


Next, let’s create a scatter plot of price vs depth.

ggplot(data, aes(x=depth, y=price)) +
  geom_point()


Now let’s create a scatter plot of price vs table.

ggplot(data, aes(x=table, y=price)) +
  geom_point()

2.3.1 Depth Deep Dive

A 3D scatter plot of price vs carat vs depth.

plot_ly(data=data, x=~price, y=~carat, z=~depth, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))


A 3D scatter plot of price vs carat vs depth with cut mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~cut, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))


A 3D scatter plot of price vs carat vs depth with color mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~color, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))


A 3D scatter plot of price vs carat vs depth with clarity mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~clarity, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))

2.3.2 Table Deep Dive

A 3D scatter plot of price vs carat vs table.

plot_ly(data=data, x=~price, y=~carat, z=~table, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="Table")))


A 3D scatter plot of price vs carat vs table with cut mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~table, color=~cut, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))


A 3D scatter plot of price vs carat vs table with color mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~table, color=~color, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))


A 3D scatter plot of price vs carat vs table with clarity mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~table, color=~clarity, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))


2.4 Best Bang for the Buck

  1. Best Bang for the Buck: Imagine someone wants to find the diamond which maximizes cut, color, and clarity per dollar. Using the expanded depth and table values from question 2 above, when does price start to increase exponentially for cut? What about for color? And clarity?

Under development.

2.5 Bigger is Better

  1. Bigger is Better: Imagine a guy named Bob who wants to buy a pair of diamonds for his wife, and have them made into earrings for her birthday. In Bob’s mind, size (carat) is all that matters. He has $3200. He needs Two diamonds with the exact same cut, color, and clarity (with very comparable depth and table values), and he wants them to be as big as possible. What size carat can he afford? If he adjusts his budget, how much does the “maximum carat” size shift? Can we plot that and fit a line to it to find the “knee in the curve”?

Under development.

3 Conclusion